1 Prerequisites

Before we even start, it is of crucial importance to first understand deeply linear and logistic regression. This is because (as we will see in later discussion) they are the very basic components in a general neural network model.

1.1 Linear Regression

A linear model can be written in a matrix form

\[ Y = X\beta + \epsilon, \] where \(Y\) is the output or label vector of length \(N\) (number of observations), \(X\) is the input feature matrix (referred to as the design matrix in statistics) with dimension \(N\) by \(P\) (number of features), \(\beta\) is the model weights/coefficients (a column vector of length \(P\)) for which we’d like to solve, \(\epsilon\) is the model residual or error vector.

1.1.1 Ordinary Least Squares

The classical way to solve for \(\beta\) is ordinary least squares. The idea of OLS is find out the weights that minimize the mean of squared model errors:

\[ \begin{aligned} \mbox{mse (loss)} &= \frac{1}{N}\sum_i\epsilon^2_i \\ &= \frac{1}{N}\sum_i(y_i - \beta x_i)^2, \end{aligned} \]

where \(y_i\) and \(x_i\) is the \(i\)-th observation. The first-order condition (requiring that the first-order derivative w.r.t. weights are zero) gives us the OLS solution for model weights \(\beta\) analytically (in matrix notation):

\[ \begin{equation} \label{eq:ols} \hat{\beta} = (X'X)^{-1}X'Y. \end{equation} \]

Let’s consider a toy model with only one non-constant feature:

\[ y_i = 6 + 4x_i + \epsilon_i. \]

In this model the outcome \(y\) is determined by a bias term plus a single variable \(x\), with an independently distributed noise term \(\epsilon \sim \mbox{Normal}(0, 1)\).

Create some random data generated from this model:

[[ 1.         -0.46820879]
 [ 1.         -0.82282485]
 [ 1.         -0.0653801 ]
 [ 1.         -0.71336192]
 [ 1.          0.90635089]
 [ 1.          0.76623673]
 [ 1.          0.82605407]
 [ 1.         -1.32368279]
 [ 1.         -1.75244452]
 [ 1.          1.00244907]]

Without consideration of the noise term, the OLS estimator will solve precisely for the true model weights:

[6. 4.]

Of course in the real world the noise term cannot be determined and usually the feature alone cannot explain entirely the variation in the outcome. This means that what we actually estimate will be the expected value of model weights:

[6.01132196 3.96247175]

By the Law of Large Number and Central Limit Theorem, the OLS estimator will converge in probability to the true model weights and distributed asymptotically Normal in large sample.1

The issue of the above approach is that equation \(\eqref{eq:ols}\) is not numerically stable when it comes to large-scale application where we may have lots of observations and lots of features. One very useful solution to solve the estimator numerically in large-scale application is the gradient descent approach.

1.1.2 Gradient Descent with Mean Squared Error

Instead of solving the first-order condition analytically, we can do it numerically. Gradient descent is a 1st-order optimization technique to find local optimum of a given function.

In the model training exercise our target function is the loss so the optimization problem is:

\[ \operatorname*{argmin}_\beta \mbox{Loss} = \frac{1}{N}\sum_i(y_i - \beta x_i)^2. \]

That is, we’d like to figure out model weights that minimize the loss which is defined by the mean squared errors when the model is a regression model.

The idea of gradient descent is to

  1. Derive the functional form of the gradient of loss w.r.t. to all weights
  2. Initialize all model weights randomly
  3. Calculate the gradient value using the actual data and the current value of weights
  4. Update the weights by (partially) the amount of gradient just calculated
  5. Repeat 3 and 4 until the resulting gradients become small enough

Let’s use the toy example to actually implement a gradent descent optimizer from scratch. First we re-write the loss function explicitly with our setup of one coefficient with a constant (\(\beta = [\beta_0, \beta_1]\)):

\[ \mbox{Loss} = \frac{1}{N}\sum_i\big[y_i-(\beta_0 + \beta_1x_i)\big]^2. \]

Now the gradient (or equivalently the 1st-order derivative) w.r.t. to weights will be:

\[ \begin{aligned} \frac{\partial\mbox{Loss}}{\partial\beta_0} &= - \frac{2}{N}\sum_i \big[ y_i - (\beta_0 + \beta_1x_i) \big], \\ \frac{\partial\mbox{Loss}}{\partial\beta_1} &= - \frac{2}{N}\sum_i \big[ y_i - (\beta_0 + \beta_1x_i) \big]x_i. \end{aligned} \]

The corresponding python function can be coded as:

If we set the above equations to zero we can solve for \(\beta_0\) and \(\beta_1\) analytically and the solution will be exactly just equation \(\eqref{eq:ols}\). But as we just pointed out it suffers from numerical stability issue.

The minimum implementation of our gradient descent optimizer is just a few lines:

[6.01132196 3.96247175]

As we can see the result is very closed to our analytical solution.

On Learning Rate

Learning rate is a hyper-parameter for gradient descent optimizer. The gradient update to our model weights is scaled down by the learning rate to make sure convergence. Too large the learning rate will explode the gradient. Too small the learning rate will slow down the convergence and sometimes result in the optimizer trapped at local sub-optimum.

Let’s re-write our gradient descent optimizer to also track the loss from each training step. And we use the same initialization for a fair comparison.

Now we run the optimization with a variety of different learning rates. For illustration purpose we will only run a few steps.

Learning Rate 0.001 | Estimate: [0.18155223 0.12379996]
Learning Rate  0.01 | Estimate: [1.60013258 1.08670659]
Learning Rate  0.05 | Estimate: [4.81547513 3.22246947]
Learning Rate   0.1 | Estimate: [5.81500752 3.85150789]
Learning Rate     1 | Estimate: [19.8158439  18.42703331]

The result suggests that a learning rate of 1 is too large for our problem. The gradient explodes which make our solution diverge. And a lower learning rate in general converges slower to the optimum. Number of examples used to calculate the gradient also will affect the convergence behavior. In general if the sample size is too small a smaller learning rate should be used to avoid gradient explosion.

This can be more clearly seen if we plot the trace of our training losses:

If the loss doesn’t decrease over training iteration, it is a signal that something is wrong with our model.

Unlike our toy implementation, in modern implementation of any numerical optimizer there will be a lot of techniques to do the best to avoid convergence failure. But it is the model developer’s responsibility to diagnose the training behavior before anything is delivered to the stakeholder. Checking the dynamics of loss is usually the first and quick step to examine whether the training task is functioning as expected.

Batch Gradient Descent

The vanilla gradient descent optimizer we just implemented has one issue. Since for each update it needs to traverse over the entire dataset, it becomes too slow when it comes to large dataset.

Batch gradient descent is too overcome this issue. Instead of calculate the gradient using the entire dataset, we can use only a random subset of it. It will be less precise but statistically the result will be consistent.

Let’s implement a batch gradient descent optimizer:

[5.98189269 3.94839182]

Batch optimizer is currently the best practice of training neural networks in large scale application. The batch size depends on the actual application but usually ranges from 8 to 1024.

Stocahstic Gradient Descent

If we set the batch size as exactly 1, i.e., to calculate the gradient update only use one data observation at a time, this is referred to as stochastic gradient descent.

Here we introduce the term epoch: One epoch is for the optimizer to traverse the entire dataset once (no matter how many examples are used to calculate the gradient in each update). Number of epochs can be considered as another hyper-parameter of a model. We can also implement epoch in our previous batch gradient descent optimizer but for simplicity we ignore that. Let’s implement it in our SGD optimizer:

[5.98165419 3.88881224]

Plot the losses for the first 500 updates.

Unlike the vanilla gradient descent, stochastic gradient descent doesn’t ensure the loss is always decreasing. But statistiaclly it should converge to the same solution as the vanilla approach.

1.2 Logistic Regression

A logistic regression models the outcome probablistically:

\[ \begin{equation} \label{eq:logit} P(Y = 1) = \frac{1}{1 + e^{-X\beta}}, \end{equation} \]

Here the sigmoid function \(s(t) = \frac{1}{1 + e^{-t}}\) is used to transform a real number into probability space \([0, 1]\).

We can interpret the model as a linear model in log-odds. Assuming Y is binary and take a value of 0 or 1, the odds of \(Y = 1\) is defined as \(\frac{P(Y = 1)}{P(Y = 0)} = \frac{P(Y = 1)}{1 - P(Y = 1)}\). We can re-arrange the logistic model equation:

\[ \begin{aligned} \ln \Bigg[ \frac{P(Y = 1)}{1 - P(Y = 1)} \Bigg] &= \ln \Bigg[ \frac{\frac{1}{1 + e^{-X\beta}}}{\frac{e^{-X\beta}}{1 + e^{-X\beta}}} \Bigg] \\ &= \ln(1 + e^{-X\beta}) - \ln e^{-X\beta}(1 + e^{-X\beta}) \\ &= \ln(1 + e^{-X\beta}) - \ln e^{-X\beta} - \ln(1 + e^{-X\beta}) \\ &= - \ln e^{-X\beta} \\ &= X\beta. \end{aligned} \]

That is, the model weights are linear in the log-odds of our target outcome.

When the probability is transformed into odds, the range is transformed from \([0,1]\) to \([0,\infty]\).

If we further transform odds to log-odds, the range is transformed from \([0,\infty]\) to \([-\infty,\infty]\).

Put it together is the effect of the sigmoid function:

1.2.1 Cross Entropy

How do we solve for the model weights \(\beta\) in equation \(\eqref{eq:logit}\)? In the linear regression model we solve for the weights by minimizing the mean squared error. In logistic regression model we need to define measurement for modeling error as well.

Put it differently, we’d like to calculate the distance between our predicted probability and the real event label distribution (called the empirical distribution). In information theory the cross entropy is used to measure the distance between two probability distribution.

The entropy of a discrete probability distribution is defined as:

\[ H(p) = - \sum_i p_i \log_2p_i, \]

where \(p_i\) is the probability for event \(i\). It measures the uncertainty of a stochastic event.

Take a coin flip event as example. We plot the entropy value at different value of coin bias (the probability of having a head instead of a tail.)

It is obvious that the entropy of this event is maximized when the probability of flipping a head is exactly 0.5. At this level the event has a highest level of uncertainty in a sense that it is the most difficult case to predict the outcome of a flip.

Now move on to cross entropy. Cross entropy between two discrete probability distribution \(p\) and \(q\) over the same support is defined as:

\[ H(p, q) = - \sum_i p_i \log_2q_i. \]

For our logistic regression model, distribution \(p\) is the empirical distribution (label distribution) and distribution \(q\) is our model predicted distribution. Denote \(q_i = P(y_i = 1)\) for the prediction of \(i\)-th example. The cross-entropy-loss of our model hence can be written as:

\[ \mbox{Cross-Entropy Loss} = - \frac{1}{N}\sum_i^N \bigg[ y_i\log_2q_i + (1 - y_i)\log_2(1 - q_i)\bigg], \]

where \(y_i \in \{0,1\}\) is the \(i\)-th binary training label and \(P(y_i = 1)\) the model prediction for the \(i\)-th example out of \(N\) total training examples. This is the mean value of cross entropy for each training example.

Since \(q_i = P(y_i = 1)\) is expressed by our model equation \(\eqref{eq:logit}\), now we can apply gradient descent to the loss function to find out the optimum model weights that minimize the cross-entropy loss.

1.2.2 Maximum Likelihood Estimator

Before we implement our optimizer for a logistic regression model, we demonstrate that minimizing cross entropy is indeed equivalent to maximizing data likelihood.

The data likelihood is just the product of all predicted probabilities for individual example, provided that they are independent.

\[ \mbox{likelihood} = \prod_i^N q_i^{y_i}(1 - q_i)^{1 - y_i}. \]

The maximum likelihood estimator will try to maximize the log of the likelihood, which is:

\[ \begin{aligned} \mbox{log-lik} &= \log_2 \prod_i^N q_i^{y_i}(1 - q_i)^{1 - y_i} \\ &= \sum_i^N \log_2 q_i^{y_i}(1 - q_i)^{1 - y_i} \\ &= \sum_i^N \bigg[ y_i\log_2 q_i + (1 - y_i)\log_2(1 - q_i) \bigg]. \end{aligned} \]

By taking the average as well, the negative log-likelihood is exactly the cross entropy.

1.2.3 Gradient Descent with Log Loss

Now let’s also implement a batch gradient descent optimizer for a logistic regression model. We will use the same toy example and create additional random binary labels for this exercise.

The gradient function must be derived with respect to model weights. To simplify the math we replace log of base 2 with natural log (with no impact on the optimization), and we take advantage of the fact that the derivative of the sigmoid function is:2

\[ \begin{aligned} \frac{ds(t)}{dt} &= \frac{d\frac{1}{1 + e^{-t}}}{dt} \\ &= -(1 + e^{-t})^{-2} \cdot (- e^{-t}) \\ &= \frac{e^{-t}}{(1 + e^{-t})^2} \\ &= \frac{1}{1 + e^{-t}} \cdot \frac{e^{-t}}{1 + e^{-t}} \\ &= s(t) \cdot [1 - s(t)]. \end{aligned} \]

The gradient in our univariate model w.r.t. the bias term will be:3

\[ \begin{aligned} \frac{\partial\mbox{LogLoss}}{\partial\beta_0} &= - \frac{1}{N}\sum_i \bigg[ y_i\frac{\partial\ln q_i}{\partial\beta_0} + (1 - y_i)\frac{\partial\ln(1 - q_i)}{\partial\beta_0}\bigg] \\ &= - \frac{1}{N}\sum_i \bigg[ y_i\frac{1}{q_i}\frac{\partial q_i}{\partial\beta_0} + (1 - y_i)\frac{1}{1 - q_i}\frac{\partial(1 - q_i)}{\partial\beta_0}\bigg] \\ &= - \frac{1}{N}\sum_i \bigg[ y_i\frac{1}{q_i}\frac{\partial (\beta_0 + \beta_1x_i)}{\partial\beta_0}\frac{\partial q_i(t)}{\partial t} - (1 - y_i)\frac{1}{1 - q_i}\frac{\partial (\beta_0 + \beta_1x_i)}{\partial\beta_0}\frac{\partial q_i(t)}{\partial t}\bigg] \\ &= - \frac{1}{N}\sum_i \bigg[ y_i\frac{1}{q_i}q_i(1 - q_i) - (1 - y_i)\frac{1}{1 - q_i}q_i(1 - q_i)\bigg] \\ &= - \frac{1}{N}\sum_i \bigg[ y_i(1 - q_i) - (1 - y_i)q_i \bigg] \\ &= - \frac{1}{N}\sum_i (y_i - q_i). \end{aligned} \]

Similary for the weight:

\[ \frac{\partial\mbox{LogLoss}}{\partial\beta_1} = - \frac{1}{N}\sum_i (y_i - q_i)x_i. \]

Now the Python code for batch gradient descent with log-loss:

[2.7018906  1.37265212]

Comparing to linear regression, a logistic regression model is harder to converge. Since our naive implementation does not do convergence diagnostics, let’s use R’s built-in glm function which use Newton’s method (a 2nd-order optimizer utilizing not only 1st-order but also 2nd-order derivatives) to check the estimation result of our toy example:

(Intercept)           x 
   7.279770    4.913451 

Now increase both the learning rate and training step of our naive gradient descent optimizer:

[7.19221871 4.82282142]

Seems better.

Or we can use the Python package statsmodels (which also uses a 2nd-order optimizer by default) to check our result:

Optimization terminated successfully.
         Current function value: 0.097274
         Iterations 10
<class 'statsmodels.iolib.summary.Summary'>
"""
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                 1000
Model:                          Logit   Df Residuals:                      998
Method:                           MLE   Df Model:                            1
Date:                Tue, 25 Jun 2019   Pseudo R-squ.:                  0.6480
Time:                        21:59:24   Log-Likelihood:                -97.274
converged:                       True   LL-Null:                       -276.32
Covariance Type:            nonrobust   LLR p-value:                 7.328e-80
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          7.2798      0.721     10.100      0.000       5.867       8.692
x1             4.9135      0.534      9.195      0.000       3.866       5.961
==============================================================================

Possibly complete quasi-separation: A fraction 0.36 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
"""

Or we can use the popular machine learning library sklearn to do the same check:

array([[7.27975064, 4.91343756]])

/Users/kylechung/Library/Python/3.6/lib/python/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning:

Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.

2 Automatic Differentiation

In the previous section we implement a simple gradient descent optimizer by manually derive the functional form of gradient on our own. This could be troublesome if our model becomes more and more complicated, as in the case of a deep neural net.

Automatci differentiation is a programming technique to calculate the gradient of any given function. One of the most popular library for this purpose is TensorFlow.

Let’s use tensorflow to implement our simple gradient descent optimizer again. But this time we will NOT explicitly derive the gradient function. Instead, we will only specify the target function which is just the loss function of our model.

/usr/local/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning:

Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.

WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

[[6.0113106]
 [3.9624662]]
[[7.1964693]
 [4.8522964]]

In the above coding example one should realize that we no longer need to hardcode the functional form of the gradients. Instead we just plug-in the loss function and let tensorflow to do the gradient calculation for us.

Of course in actual development we will use higher-level APIs to implement our model, where the entire optimization process is abstracted away from the application code.

3 Neural Networks

Both linear regression and logistic regression can be considered as simple additive model of the form:

\[ \hat{y} = \Phi\bigg(\sum_{i=1}^Pw_ix_i\bigg), \]

where \(P\) is the number of features used, \(x_i\) is the \(i\)-th feature, and \(\Phi(\cdot)\) is a function applied to the output. In linear regression \(\Phi(\cdot)\) is simply an identity function. In logistic regression \(\Phi(\cdot)\) is the standard sigmoid function.

Now consider there is a way to ensemble multiple such additive models together to generate a potentially better and more sophisticated model. We use the following diagram for illustration.

Registered S3 methods overwritten by 'ggplot2':
  method         from 
  [.quosures     rlang
  c.quosures     rlang
  print.quosures rlang

In the above diagram, \(Y_{11}\) is a single additive model

\[ Y_{11} = \Phi(W_{111}X_1 + W_{211}X_2 + W_{311}X_3). \]

Similarly \(Y_{12}\) is another such model (with the same input feature set but different model weights)

\[ Y_{12} = \Phi(W_{112}X_1 + W_{212}X_2 + W_{312}X_3). \]

Model \(Y_2\) is yet another additive model but takes the output of the above two models:

\[ Y_2 = \Phi(W_{1121}Y_{11} + W_{1221}Y_{12}). \]

The above setup is a simple architecture of a neural network model, with only one hidden layer of two neurons. (We also ignore the constant/bias term in each layer for simplicity.) A neuron is simply an additive model with a so-called activation function \(\Phi(\cdot)\) to transform the output from any real number into a scaled signal.

One now can easily realize that a logistic regression model could be viewed as a degenerated neural network model with single neuron and using sigmoid as the activation function. And a linear regression model is a degenerated neural network model with single neuron and without an activation function.

3.1 Activation Function

Why do we need the activation function? In the above neural network model in the absence of activation function the final output model \(Y_2\) will degenerate into a simple linear model. We can use simpler notations to demonstrate this:

\[ \begin{aligned} y_1 &= ax + b, \\ y_2 &= cx + d, \\ y_3 &= e + fy_1 + gy_2 \\ &= e + f(ax + b) + g(cx + d) \\ &= \underbrace{(e + fb + gd)}_\text{Bias} + \underbrace{(fa + gc)}_\text{Weight}x. \end{aligned} \]

Without activation function, no matter how many neurons or layers we design for our model, it eventually reduces to a simple linear model. With the activation function applied to each neuron, the model becomes non-linear and hence can handle much more complicated patterns hidden behind the data.

Some popular activation functions:

3.2 Backpropagation

To solve for model weights in a neural network, we use a technique called backpropagation which is essentially an iterative process of gradient descent.

To simplify notation we assume each neuron is simply a univariate model. Consider the following minimum architecture:

Mathematically:

\[ \begin{aligned} \hat{y} &= \Phi(b_1 + w_1z) \\ &= \Phi(b_1 + w_1\Phi(b_0 + w_0x)), \end{aligned} \]

where

\[ z = \Phi(b_0 + w_0x), \]

and

\[ \Phi(t) = \frac{1}{1 + e^{-t}}. \]

3.2.1 MSE Loss

Though it may not be very meaningful to use MSE loss when the output layer is applied with an activation function, we can still do it for educational purpose.

\[ \mbox{MSE-Loss} = \frac{1}{N}\sum_i^N (y_i - \hat{y}_i)^2. \]

Firstly we take the derivative w.r.t. the weight in the last layer:

\[ \begin{aligned} \frac{\partial \mbox{MSE-Loss}}{\partial w_1} &= - \frac{1}{N}\sum_i (y_i - \hat{y}_i)\frac{\partial \hat{y}_i}{\partial w_1} \\ &= - \frac{1}{N}\sum_i (y_i - \hat{y}_i) \underbrace{ \frac{\partial t}{\partial w_1}\frac{\partial \Phi(t)}{\partial t} }_{t = b_1 + w_1\Phi(b_0 + w_0x)}\\ &= - \frac{1}{N}\sum_i (y_i - \hat{y}_i) \cdot \Phi(t)(1 - \Phi(t)) \cdot z_i. \end{aligned} \]

Similarly for the bias in the last layer:

\[ \frac{\partial \mbox{MSE-Loss}}{\partial b_1} = - \frac{1}{N}\sum_i (y_i - \hat{y}_i) \cdot \Phi(t)(1 - \Phi(t)). \]

Now move on to the bias and weight in the first layer:

\[ \begin{aligned} \frac{\partial \mbox{MSE-Loss}}{\partial w_0} &= - \frac{1}{N}\sum_i (y_i - \hat{y}_i)\frac{\partial \hat{y}_i}{\partial w_0} \\ &= - \frac{1}{N}\sum_i (y_i - \hat{y}_i) \underbrace{ \frac{\partial t}{\partial w_0}\frac{\partial \Phi(t)}{\partial t} }_{t = b_1 + w_1\Phi(b_0 + w_0x)}\\ &= - \frac{1}{N}\sum_i (y_i - \hat{y}_i) \cdot \Phi(t)(1 - \Phi(t)) \cdot \underbrace{ \Phi(k)(1 - \Phi(k)) }_{k = b_0 +w_0x} \cdot w_1 \cdot x_i, \\ \frac{\partial \mbox{MSE-Loss}}{\partial b_0} &= - \frac{1}{N}\sum_i (y_i - \hat{y}_i) \cdot \Phi(t)(1 - \Phi(t)) \cdot \Phi(k)(1 - \Phi(k)). \end{aligned} \]

One can clearly see there is a linkage between the derivative of the weights in consecutive layers:

\[ \begin{aligned} \frac{\partial \mbox{MSE-Loss}}{\partial w_1} &= - \frac{1}{N}\sum_i (y_i - \hat{y}_i) \cdot \Phi(t)(1 - \Phi(t)) \cdot z_i, \\ \frac{\partial \mbox{MSE-Loss}}{\partial w_0} &= - \frac{1}{N}\sum_i (y_i - \hat{y}_i) \cdot \underbrace{\Phi(t)(1 - \Phi(t))}_{\hat{y}_i(1 - \hat{y}_i)} \cdot \underbrace{\Phi(k)(1 - \Phi(k)) \cdot w_1 \cdot x_i}_{\frac{\partial w_1z}{\partial w_0} = \frac{\partial w_1\Phi(b0 + w_0x)}{\partial w_0}}. \end{aligned} \]

3.2.2 Cross-Entropy Loss

Similarly we can derive the gradients for cross-entropy loss.

\[ \begin{aligned} \mbox{LogLoss} &= - \frac{1}{N}\sum_i^N \bigg[ y_i\ln\hat{y}_i + (1 - y_i)\ln(1 - \hat{y}_i)\bigg], \\ \frac{\partial \mbox{LogLoss}}{\partial w_1} &= - \frac{1}{N} \sum_i (y_i - \hat{y}_i)z_i, \\ \frac{\partial \mbox{LogLoss}}{\partial w_0} &= - \frac{1}{N} \sum_i (y_i - \hat{y}_i)\Phi(k)(1 - \Phi(k))w_1x_i.\\ \end{aligned} \]

(Skip the bias terms to save space.)

Now let’s implement the simple neural network model in Python:

Here we will use the hello-world example of artificial neural network: The XOR problem. The XOR logical outcome, besides extremely simple, is not linearly separable. So it serves as a good example of showcasing neural networks’ non-linearity.

The input data is simply combination of two binary switches. (For completeness we also include a constant term which always evaluate to 1 as the first feature.) The output data is the XOR result.

[[1 0 0]
 [1 1 0]
 [1 0 1]
 [1 1 1]]
[[0]
 [1]
 [1]
 [0]]

Now let’s see if our simple neural network model can learn the XOR pattern.

'0.016509058781027677,0.9728542987218917,0.9784540441634975,0.027059912943542017'
[[-2.02258664  1.42223805 -1.71522187 -4.26403023]
 [ 6.22204576  5.73421659 -0.35405646  2.29735218]
 [ 5.86924511 -3.55734133  4.32911695  3.3323955 ]]
[[10.56290787]
 [-5.62409836]
 [-4.67738934]
 [-5.56713542]]
'0.0001870860964935529,0.9994879226437576,0.9993958543889606,0.0006300370523400314'

For this simple example, both MSE and cross-entropy loss can work fine to figure out the XOR pattern. (Strictly speaking cross-entropy loss performs better.)

3.3 Regularization

3.3.1 L2-Norm

3.3.2 L1-Norm

3.3.3 Dropout


  1. One can try change the sample size arbitrarily to see the behavior of the estimator.

  2. Here we use the common derivatives: \(\frac{df(x)^n}{dx} = nf(x)^{n-1}f'(x)\) and \(\frac{de^{f(x)}}{dx} = f'(x)e^{f(x)}.\)

  3. Here we use the common derivative \(\frac{d\ln f(x)}{dx} = \frac{f'(x)}{f(x)}\) and the chain rule to handle the term \(\frac{\partial\ln q_i}{\partial\beta_0}\) where \(q_i\) is the sigmoid computed for the \(i\)-th example.

---
title: "Neural Networks Fundamentals"
subtitle: "with Examples using Python"
author:
- name: Kyle Chung
  affiliation:
date: "`r format(Sys.time(), '%d %B %Y')`"
output:
  html_notebook: 
    number_sections: yes
    theme: flatly
    toc: yes
    highlight: pygments
---
<!-- Embed plotly javascript library.
  The preview of plotly output in RStudio won't work
  since we don't include the plotly.js in each individual plot.
--> 
<script src="js/plotly-latest.min.js"></script>

<!--For equation reference.--> 
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
  TeX: { equationNumbers: { autoNumber: "AMS" } }
});
</script>

```{r setup, include=FALSE}
library(reticulate)
use_python(Sys.getenv("PYTHON_PATH"), required=TRUE)
```

# Prerequisites

Before we even start,
it is of crucial importance to first understand deeply linear and logistic regression.
This is because (as we will see in later discussion) they are the very basic components in a general neural network model.

## Linear Regression

A linear model can be written in a matrix form

$$
Y = X\beta + \epsilon,
$$
where $Y$ is the output or label vector of length $N$ (number of observations),
$X$ is the input feature matrix (referred to as the *design matrix* in statistics) with dimension $N$ by $P$ (number of features),
$\beta$ is the model weights/coefficients (a column vector of length $P$) for which we'd like to solve,
$\epsilon$ is the model residual or error vector.

### Ordinary Least Squares

The classical way to solve for $\beta$ is [ordinary least squares](https://en.wikipedia.org/wiki/Ordinary_least_squares).
The idea of OLS is find out the weights that minimize the mean of squared model errors:

$$
\begin{aligned}
\mbox{mse (loss)}
&= \frac{1}{N}\sum_i\epsilon^2_i \\
&= \frac{1}{N}\sum_i(y_i - \beta x_i)^2,
\end{aligned}
$$

where $y_i$ and $x_i$ is the $i$-th observation.
The first-order condition (requiring that the first-order derivative w.r.t. weights are zero) gives us the OLS solution for model weights $\beta$ analytically (in matrix notation):

$$
\begin{equation} \label{eq:ols}
\hat{\beta} = (X'X)^{-1}X'Y.
\end{equation}
$$

```{python linear_reg_ols}
import numpy as np
from numpy.linalg import inv

def ols(X, y):
    return inv(X.T.dot(X)).dot(X.T).dot(y)
```

Let's consider a toy model with only one non-constant feature:

$$
y_i = 6 + 4x_i + \epsilon_i.
$$

In this model the outcome $y$ is determined by a bias term plus a single variable $x$,
with an independently distributed noise term $\epsilon \sim \mbox{Normal}(0, 1)$.

Create some random data generated from this model:

```{python linear_reg_toy_example}
# Create toy example.
np.random.seed(777)

N = 1000
X = np.stack([np.ones(N), np.random.normal(size=N)], axis=1)
beta = np.array([6, 4], dtype=np.float32)  # True model weights.
e = np.random.normal(size=N)
y = X.dot(beta) + e

print(X[:10])
```

```{python linear_reg_plot}
from plotly.offline import plot
import plotly.graph_objs as go

def plot_offline(data, layout, ofile):
  p = plot({"data": data, "layout": layout},
           filename=ofile, auto_open=False, include_plotlyjs=False)
  return p

ofile = "plots/toy_reg.html"
p = plot_offline(
  ofile=ofile,
  data=[go.Scatter(x=X[:,1], y=y, mode="markers")],
  layout=go.Layout(title="Data Generated from Toy Model",
                   xaxis=dict(title="x"),
                   yaxis=dict(title="y")))
```

```{r, echo=FALSE}
htmltools::includeHTML(py$ofile)
```

Without consideration of the noise term,
the OLS estimator will solve precisely for the true model weights:

```{python linear_reg_ols_deterministic}
print(ols(X, y - e))
```

Of course in the real world the noise term cannot be determined and usually the feature alone cannot explain entirely the variation in the outcome.
This means that what we actually estimate will be the expected value of model weights:

```{python linear_reg_ols_with_epsilon}
print(ols(X, y))
```

By the [Law of Large Number](https://en.wikipedia.org/wiki/Law_of_large_numbers) and [Central Limit Theorem](https://en.wikipedia.org/wiki/Central_limit_theorem),
the OLS estimator will converge in probability to the true model weights and distributed asymptotically Normal in large sample.^[One can try change the sample size arbitrarily to see the behavior of the estimator.]

The issue of the above approach is that equation $\eqref{eq:ols}$ is not numerically stable when it comes to large-scale application where we may have lots of observations and lots of features.
One very useful solution to solve the estimator numerically in large-scale application is the *gradient descent* approach.

### Gradient Descent with Mean Squared Error

Instead of solving the first-order condition analytically,
we can do it numerically.
Gradient descent is a 1st-order optimization technique to find local optimum of a given function.

In the model training exercise our target function is the loss so the optimization problem is:

$$
\operatorname*{argmin}_\beta \mbox{Loss} = \frac{1}{N}\sum_i(y_i - \beta x_i)^2.
$$

That is,
we'd like to figure out model weights that minimize the loss which is defined by the mean squared errors when the model is a regression model.

The idea of gradient descent is to

1. Derive the functional form of the gradient of loss w.r.t. to all weights
2. Initialize all model weights randomly
3. Calculate the gradient value using the actual data and the current value of weights
4. Update the weights by (partially) the amount of gradient just calculated
5. Repeat 3 and 4 until the resulting gradients become small enough

Let's use the toy example to actually implement a gradent descent optimizer from scratch.
First we re-write the loss function explicitly with our setup of one coefficient with a constant ($\beta = [\beta_0, \beta_1]$):

$$
\mbox{Loss} = \frac{1}{N}\sum_i\big[y_i-(\beta_0 + \beta_1x_i)\big]^2.
$$

Now the gradient (or equivalently the 1st-order derivative) w.r.t. to weights will be:

$$
\begin{aligned}
\frac{\partial\mbox{Loss}}{\partial\beta_0} 
&= - \frac{2}{N}\sum_i \big[ y_i - (\beta_0 + \beta_1x_i) \big], \\
\frac{\partial\mbox{Loss}}{\partial\beta_1} 
&= - \frac{2}{N}\sum_i \big[ y_i - (\beta_0 + \beta_1x_i) \big]x_i.
\end{aligned}
$$

The corresponding python function can be coded as:

```{python linear_reg_grad_func}
def grad_func(X, y, beta):
  """Calculate vectorized gradients."""
  return -2 * ((y - X.dot(beta)).dot(X)) / X.shape[0]
```

If we set the above equations to zero we can solve for $\beta_0$ and $\beta_1$ analytically and the solution will be exactly just equation $\eqref{eq:ols}$.
But as we just pointed out it suffers from numerical stability issue.

The minimum implementation of our gradient descent optimizer is just a few lines:

```{python linear_reg_gd}
def gd_optimize(X, y, lr=.01, n_step=100):
  b = np.random.normal(size=X.shape[1])
  for step in range(n_step):
    b -= lr*grad_func(X, y, b)
  return b

print(gd_optimize(X, y, n_step=3000))
```

As we can see the result is very closed to our analytical solution.

#### On Learning Rate {-}

Learning rate is a hyper-parameter for gradient descent optimizer.
The gradient update to our model weights is scaled down by the learning rate to make sure convergence.
Too large the learning rate will explode the gradient.
Too small the learning rate will slow down the convergence and sometimes result in the optimizer trapped at local sub-optimum.

Let's re-write our gradient descent optimizer to also track the loss from each training step.
And we use the same initialization for a fair comparison.

```{python linear_reg_gd_with_loss}
def loss(X, y, beta):
  return ((X.dot(beta) - y)**2).mean()

def gd_optimize(X, y, lr=.01, n_step=100):
  b = np.array([0.0, 0.0])
  l = [loss(X, y, b)]
  for step in range(n_step):
    b -= lr*grad_func(X, y, b)
    l.append(loss(X, y, b))
  return b, l
```

Now we run the optimization with a variety of different learning rates.
For illustration purpose we will only run a few steps.

```{python linear_reg_diff_lr}
betas = {}
losses = {}
for lr in [.001, .01, .05, .1, 1]:
  betas[lr], losses[lr] = gd_optimize(X, y, lr=lr, n_step=15)

for r, b in betas.items():
  print("Learning Rate {:5} | Estimate: {}".format(r, b))
```

The result suggests that a learning rate of 1 is too large for our problem.
The gradient explodes which make our solution diverge.
And a lower learning rate in general converges slower to the optimum.
Number of examples used to calculate the gradient also will affect the convergence behavior.
In general if the sample size is too small a smaller learning rate should be used to avoid gradient explosion.

This can be more clearly seen if we plot the trace of our training losses:

```{python linear_reg_diff_lr_plot}
ofile = "plots/toy_reg_loss_compare.html"
pdata = [go.Scatter(x=np.arange(len(losses[lr])), y=losses[lr], name=lr) 
         for lr in losses.keys()]
p = plot_offline(
  ofile=ofile,
  data=pdata,
  layout=go.Layout(
    title="Trace of Training Loss with Various Learning Rates",
    xaxis=dict(title="Step"),
    yaxis=dict(title="Loss")))
```

```{r, echo=FALSE}
htmltools::includeHTML(py$ofile)
```

If the loss doesn't decrease over training iteration,
it is a signal that something is wrong with our model.

Unlike our toy implementation,
in modern implementation of any numerical optimizer there will be a lot of techniques to do the best to avoid convergence failure.
But it is the model developer's responsibility to diagnose the training behavior before anything is delivered to the stakeholder.
Checking the dynamics of loss is usually the first and quick step to examine whether the training task is functioning as expected.

#### Batch Gradient Descent {-}

The vanilla gradient descent optimizer we just implemented has one issue.
Since for each update it needs to traverse over the entire dataset,
it becomes too slow when it comes to large dataset.

Batch gradient descent is too overcome this issue.
Instead of calculate the gradient using the entire dataset,
we can use only a random subset of it.
It will be less precise but statistically the result will be consistent.

Let's implement a batch gradient descent optimizer:

```{python linear_reg_gd_batch}
def gd_batch_optimize(X, y, lr=.01, n_step=100, batch_size=64):
  b = np.random.normal(size=X.shape[1])
  l = [loss(X, y, b)]
  for step in range(n_step):
    sid = np.random.randint(X.shape[0], size=batch_size)
    Xb = X[sid,:]
    yb = y[sid]
    b -= lr*grad_func(Xb, yb, b)
    l.append(loss(Xb, yb, b))
  return b, l

batch_beta, batch_loss = gd_batch_optimize(X, y, n_step=3000, batch_size=64)
print(batch_beta)
```

Batch optimizer is currently the best practice of training neural networks in large scale application.
The batch size depends on the actual application but usually ranges from 8 to 1024.

#### Stocahstic Gradient Descent {-}

If we set the batch size as exactly 1,
i.e.,
to calculate the gradient update only use one data observation at a time,
this is referred to as stochastic gradient descent.

Here we introduce the term *epoch*:
One epoch is for the optimizer to traverse the entire dataset once (no matter how many examples are used to calculate the gradient in each update).
Number of epochs can be considered as another hyper-parameter of a model.
We can also implement epoch in our previous batch gradient descent optimizer but for simplicity we ignore that.
Let's implement it in our SGD optimizer:

```{python linear_reg_sgd}
def sgd_optimize(X, y, lr=.01, n_epoch=100):
  b = np.random.normal(size=X.shape[1])
  l = [loss(X, y, b)]
  for epoch in range(n_epoch):
    sid = np.random.permutation(X.shape[0])
    for i in sid:
      b -= lr*grad_func(X[None,i,:], y[i], b)
      l.append(loss(X[None,i,:], y[i], b))
  return b, l

sgd_beta, sgd_loss = sgd_optimize(X, y, lr=.01, n_epoch=10)
print(sgd_beta)
```

Plot the losses for the first 500 updates.

```{python linear_reg_sgd_plot}
ofile = "plots/toy_reg_sgd_loss.html"
p = plot_offline(
  ofile=ofile,
  data=[go.Scatter(x=np.arange(len(sgd_loss[:500])), y=sgd_loss[:500])],
  layout=go.Layout(
    title="Trace of SGD Training Loss",
    xaxis=dict(title="Step"),
    yaxis=dict(title="Loss")))
```

```{r, echo=FALSE}
htmltools::includeHTML(py$ofile)
```

Unlike the vanilla gradient descent,
stochastic gradient descent doesn't ensure the loss is always decreasing.
But statistiaclly it should converge to the same solution as the vanilla approach.

## Logistic Regression

A logistic regression models the outcome probablistically:

$$
\begin{equation} \label{eq:logit}
P(Y = 1) = \frac{1}{1 + e^{-X\beta}},
\end{equation}
$$

Here the sigmoid function $s(t) = \frac{1}{1 + e^{-t}}$ is used to transform a real number into probability space $[0, 1]$.

We can interpret the model as a linear model in log-odds.
Assuming Y is binary and take a value of 0 or 1,
the odds of $Y = 1$ is defined as $\frac{P(Y = 1)}{P(Y = 0)} = \frac{P(Y = 1)}{1 - P(Y = 1)}$.
We can re-arrange the logistic model equation:

$$
\begin{aligned}
\ln \Bigg[ \frac{P(Y = 1)}{1 - P(Y = 1)} \Bigg] 
&= \ln \Bigg[ \frac{\frac{1}{1 + e^{-X\beta}}}{\frac{e^{-X\beta}}{1 + e^{-X\beta}}} \Bigg] \\
&= \ln(1 + e^{-X\beta}) - \ln e^{-X\beta}(1 + e^{-X\beta}) \\
&= \ln(1 + e^{-X\beta}) - \ln e^{-X\beta} - \ln(1 + e^{-X\beta}) \\
&= - \ln e^{-X\beta} \\
&= X\beta.
\end{aligned}
$$

That is,
the model weights are linear in the log-odds of our target outcome.

When the probability is transformed into odds,
the range is transformed from $[0,1]$ to $[0,\infty]$.

```{python logit_prob_odds_plot}
# We plot only up to .95 since the value of odds will grow unboundedly.
prob = np.linspace(.001, .95, num=100)
odds = prob / (1 - prob)
log_odds = np.log(odds)

ofile = "plots/prob_odds.html"
p = plot_offline(
  ofile=ofile,
  data=[go.Scatter(x=prob, y=odds)],
  layout=go.Layout(
    title="Event Probabilities to Odds",
    xaxis=dict(title="Probability"),
    yaxis=dict(title="Odds")))
```

```{r, echo=FALSE}
htmltools::includeHTML(py$ofile)
```

If we further transform odds to log-odds,
the range is transformed from $[0,\infty]$ to $[-\infty,\infty]$.

```{python logit_odds_logodds_plot}
ofile = "plots/prob_odds.html"
p = plot_offline(
  ofile=ofile,
  data=[go.Scatter(x=odds, y=log_odds)],
  layout=go.Layout(
    title="Event Odds to Log-Odds",
    xaxis=dict(title="Odds"),
    yaxis=dict(title="Log-Odds")))
```

```{r, echo=FALSE}
htmltools::includeHTML(py$ofile)
```

Put it together is the effect of the sigmoid function:

```{python logit_sigmoid_plot}
# Note that instead of coding s = 1 / (1 + np.exp(-t)),
# which is numerically unstable,
# we should use a math trick to make it stable.
def sigmoid(t):
  """Numerically stable sigmoid."""
  return np.exp(-np.logaddexp(0, -t))

t = np.linspace(-10, 10, num=100)
ofile = "plots/sigmoid.html"
p = plot_offline(
  ofile=ofile,
  data=[go.Scatter(x=t, y=sigmoid(t))],
  layout=go.Layout(
    title="A Sigmoid Function",
    xaxis=dict(title="Raw Value"),
    yaxis=dict(title="Probability")))
```

```{r, echo=FALSE}
htmltools::includeHTML(py$ofile)
```

### Cross Entropy

How do we solve for the model weights $\beta$ in equation $\eqref{eq:logit}$?
In the linear regression model we solve for the weights by minimizing the mean squared error.
In logistic regression model we need to define measurement for modeling error as well.

Put it differently,
we'd like to calculate the distance between our predicted probability and the real event label distribution (called the *empirical distribution*).
In information theory the cross entropy is used to measure the distance between two probability distribution.

The [entropy](https://en.wikipedia.org/wiki/Entropy_(information_theory)) of a discrete probability distribution is defined as:

$$
H(p) = - \sum_i p_i \log_2p_i,
$$

where $p_i$ is the probability for event $i$.
It measures the uncertainty of a stochastic event.

Take a coin flip event as example.
We plot the entropy value at different value of coin bias (the probability of having a head instead of a tail.)

```{python logit_coin_flip_entropy_plot}
def entropy_binary(p):
  return -p*np.log2(p) - (1 - p)*np.log2(1 - p)
  
prob = np.linspace(.01, .99, num=100)

ofile = "plots/entropy.html"
p = plot_offline(
  ofile=ofile,
  data=[go.Scatter(x=prob, y=entropy_binary(prob))],
  layout=go.Layout(
    title="Entropy of a Binary Event (Coin Flip)",
    xaxis=dict(title="Probability of a Head"),
    yaxis=dict(title="Entropy")))
```

```{r, echo=FALSE}
htmltools::includeHTML(py$ofile)
```

It is obvious that the entropy of this event is maximized when the probability of flipping a head is exactly 0.5.
At this level the event has a highest level of uncertainty in a sense that it is the most difficult case to predict the outcome of a flip.

Now move on to cross entropy.
Cross entropy between two discrete probability distribution $p$ and $q$ over the same support is defined as:

$$
H(p, q) = - \sum_i p_i \log_2q_i.
$$

For our logistic regression model,
distribution $p$ is the empirical distribution (label distribution) and distribution $q$ is our model predicted distribution.
Denote $q_i = P(y_i = 1)$ for the prediction of $i$-th example.
The cross-entropy-loss of our model hence can be written as:

$$
\mbox{Cross-Entropy Loss} = - \frac{1}{N}\sum_i^N \bigg[ y_i\log_2q_i + (1 - y_i)\log_2(1 - q_i)\bigg],
$$

where $y_i \in \{0,1\}$ is the $i$-th binary training label and $P(y_i = 1)$ the model prediction for the $i$-th example out of $N$ total training examples.
This is the mean value of cross entropy for each training example.

Since $q_i = P(y_i = 1)$ is expressed by our model equation $\eqref{eq:logit}$,
now we can apply gradient descent to the loss function to find out the optimum model weights that minimize the cross-entropy loss.

### Maximum Likelihood Estimator

Before we implement our optimizer for a logistic regression model,
we demonstrate that minimizing cross entropy is indeed equivalent to maximizing data likelihood.

The data likelihood is just the product of all predicted probabilities for individual example,
provided that they are independent.

$$
\mbox{likelihood} = \prod_i^N q_i^{y_i}(1 - q_i)^{1 - y_i}.
$$

The maximum likelihood estimator will try to maximize the log of the likelihood,
which is:

$$
\begin{aligned}
\mbox{log-lik} 
&= \log_2 \prod_i^N q_i^{y_i}(1 - q_i)^{1 - y_i} \\
&= \sum_i^N \log_2 q_i^{y_i}(1 - q_i)^{1 - y_i} \\
&= \sum_i^N \bigg[ y_i\log_2 q_i + (1 - y_i)\log_2(1 - q_i) \bigg].
\end{aligned}
$$

By taking the average as well,
the negative log-likelihood is exactly the cross entropy.

### Gradient Descent with Log Loss

Now let's also implement a batch gradient descent optimizer for a logistic regression model.
We will use the same toy example and create additional random binary labels for this exercise.

```{python logit_fake_data}
y2 = np.array([np.random.binomial(1, v) for v in sigmoid(X.dot(beta))])
```

The gradient function must be derived with respect to model weights.
To simplify the math we replace log of base 2 with natural log (with no impact on the optimization),
and we take advantage of the fact that the derivative of the sigmoid function is:^[Here we use the common derivatives: $\frac{df(x)^n}{dx} = nf(x)^{n-1}f'(x)$ and $\frac{de^{f(x)}}{dx} = f'(x)e^{f(x)}.$]

$$
\begin{aligned}
\frac{ds(t)}{dt}
&= \frac{d\frac{1}{1 + e^{-t}}}{dt} \\
&= -(1 + e^{-t})^{-2} \cdot (- e^{-t}) \\
&= \frac{e^{-t}}{(1 + e^{-t})^2} \\
&= \frac{1}{1 + e^{-t}} \cdot \frac{e^{-t}}{1 + e^{-t}} \\
&= s(t) \cdot [1 - s(t)].
\end{aligned}
$$

The gradient in our univariate model w.r.t. the bias term will be:^[Here we use the common derivative $\frac{d\ln f(x)}{dx} = \frac{f'(x)}{f(x)}$ and the [chain rule](https://en.wikipedia.org/wiki/Chain_rule) to handle the term $\frac{\partial\ln q_i}{\partial\beta_0}$ where $q_i$ is the sigmoid computed for the $i$-th example.]

$$
\begin{aligned}
\frac{\partial\mbox{LogLoss}}{\partial\beta_0} 
&= - \frac{1}{N}\sum_i \bigg[ y_i\frac{\partial\ln q_i}{\partial\beta_0} + 
  (1 - y_i)\frac{\partial\ln(1 - q_i)}{\partial\beta_0}\bigg] \\
&= - \frac{1}{N}\sum_i \bigg[ y_i\frac{1}{q_i}\frac{\partial q_i}{\partial\beta_0} + 
  (1 - y_i)\frac{1}{1 - q_i}\frac{\partial(1 - q_i)}{\partial\beta_0}\bigg] \\
&= - \frac{1}{N}\sum_i \bigg[ y_i\frac{1}{q_i}\frac{\partial (\beta_0 + \beta_1x_i)}{\partial\beta_0}\frac{\partial q_i(t)}{\partial t} - 
  (1 - y_i)\frac{1}{1 - q_i}\frac{\partial (\beta_0 + \beta_1x_i)}{\partial\beta_0}\frac{\partial q_i(t)}{\partial t}\bigg] \\
&= - \frac{1}{N}\sum_i \bigg[ y_i\frac{1}{q_i}q_i(1 - q_i) - 
  (1 - y_i)\frac{1}{1 - q_i}q_i(1 - q_i)\bigg] \\
&= - \frac{1}{N}\sum_i \bigg[ y_i(1 - q_i) - (1 - y_i)q_i \bigg] \\
&= - \frac{1}{N}\sum_i (y_i - q_i).
\end{aligned}
$$

Similary for the weight:

$$
\frac{\partial\mbox{LogLoss}}{\partial\beta_1}
= - \frac{1}{N}\sum_i (y_i - q_i)x_i.
$$

Now the Python code for batch gradient descent with log-loss:

```{python logit_gd}
def logloss(X, y, b):
  logloss_pos = y * np.log(sigmoid(X.dot(b)))
  logloss_neg = (1 - y) * np.log(1 - sigmoid(X.dot(b)))
  return - (logloss_pos + logloss_neg).mean()

def logit_grad_func(X, y, b):
  return - (y - sigmoid(X.dot(b))).dot(X) / X.shape[0]

def cross_entropy_gd_optimize(X, y, lr=.01, n_step=100, batch_size=64):
  b = np.random.normal(size=X.shape[1])
  l = [logloss(X, y, b)]
  for step in range(n_step):
    sid = np.random.randint(X.shape[0], size=batch_size)
    Xb = X[sid,:]
    yb = y[sid]
    b -= lr*logit_grad_func(Xb, yb, b)
    l.append(logloss(Xb, yb, b))
  return b, l

log_beta, log_loss = cross_entropy_gd_optimize(X, y2, lr=.01, n_step=3000)
print(log_beta)
```

```{python logit_loss_plot}
ofile = "plots/toy_log_loss.html"
pdata = [go.Scatter(x=np.arange(len(log_loss)), y=log_loss)]
p = plot_offline(
  ofile=ofile,
  data=pdata,
  layout=go.Layout(
    title="Trace of Training Log-Loss",
    xaxis=dict(title="Step"),
    yaxis=dict(title="Loss")))
```

```{r, echo=FALSE}
htmltools::includeHTML(py$ofile)
```

Comparing to linear regression,
a logistic regression model is harder to converge.
Since our naive implementation does not do convergence diagnostics,
let's use R's built-in `glm` function which use Newton's method (a 2nd-order optimizer utilizing not only 1st-order but also 2nd-order derivatives) to check the estimation result of our toy example:

```{r logit_check_solution}
# This is R code.
coef(glm(y ~ x, data=data.frame(y=py$y2, x=py$X[,2]), family=binomial))
```

Now increase both the learning rate and training step of our naive gradient descent optimizer:

```{python}
print(cross_entropy_gd_optimize(X, y2, lr=.5, n_step=5000)[0])
```

Seems better.

Or we can use the Python package `statsmodels` (which also uses a 2nd-order optimizer by default) to check our result: 

```{python}
import statsmodels.api as sm
sm.Logit(y2, X).fit().summary()
```

Or we can use the popular machine learning library `sklearn` to do the same check:

```{python}
from sklearn.linear_model import LogisticRegression
# We need to set C to an arbitrarily large number to suppress regularization
# since our naive approach doesn't implement any regularization.
LogisticRegression(C=1e16, fit_intercept=False).fit(X, y2).coef_
```

# Automatic Differentiation

In the previous section we implement a simple gradient descent optimizer by manually derive the functional form of gradient on our own.
This could be troublesome if our model becomes more and more complicated,
as in the case of a deep neural net.

Automatci differentiation is a programming technique to calculate the gradient of any given function.
One of the most popular library for this purpose is [TensorFlow](https://github.com/tensorflow/tensorflow).

Let's use `tensorflow` to implement our simple gradient descent optimizer again.
But this time we will NOT explicitly derive the gradient function.
Instead,
we will only specify the target function which is just the loss function of our model.

```{python tf_ad}
import tensorflow as tf
tf.enable_eager_execution()

# Use tensor to represent our data.
# Note that we need to be very specific about dtype/shape of our tensors.
X_tf = tf.convert_to_tensor(X, dtype=tf.float32)
beta_tf = tf.reshape(tf.convert_to_tensor(beta, dtype=tf.float32), (2,1))
e_tf = tf.reshape(tf.convert_to_tensor(e, dtype=tf.float32), (N, 1))
y_tf = tf.matmul(X_tf, beta_tf) + e_tf
y2_tf = tf.reshape(tf.convert_to_tensor(y2, dtype=tf.float32), (N, 1))

def loss_mse_tf(X, y, beta):
  y_hat = tf.matmul(X, beta)
  return tf.reduce_mean(input_tensor = (y_hat - y)**2)

def logloss_tf_bad(X, y, beta):
  # This can suffer from numerical stability issue.
  logloss_pos = y * tf.log(tf.sigmoid(tf.matmul(X, beta)))
  logloss_neg = (1 - y) * tf.log(1 - tf.sigmoid(tf.matmul(X, beta)))
  return - tf.reduce_mean(logloss_pos + logloss_neg)

def logloss_tf(X, y, beta):
  # tf.sigmoid is not numerically stable.
  # But there is a convenient function to do the cross entropy calculation stably.
  s = tf.nn.sigmoid_cross_entropy_with_logits(labels=y, logits=tf.matmul(X, beta))
  return tf.reduce_mean(s)

def gd_optimize_tf(X, y, loss_func, lr=.01, n_step=100):
  grad_func_tf = tf.contrib.eager.gradients_function(loss_func, params=[2])
  beta = tf.random_normal((2, 1))
  for step in range(n_step):
    grad = grad_func_tf(X, y, beta)[0]
    beta -= lr * grad
  return beta.numpy()

# For the MSE loss problem.
print(gd_optimize_tf(X_tf, y_tf, loss_func=loss_mse_tf, n_step=3000))

# For the cross entropy loss problem.
print(gd_optimize_tf(X_tf, y2_tf, loss_func=logloss_tf, lr=.5, n_step=5000))
```

In the above coding example one should realize that we no longer need to hardcode the functional form of the gradients.
Instead we just plug-in the loss function and let `tensorflow` to do the gradient calculation for us.

Of course in actual development we will use higher-level APIs to implement our model,
where the entire optimization process is abstracted away from the application code.

# Neural Networks

Both linear regression and logistic regression can be considered as simple additive model of the form:

$$
\hat{y} = \Phi\bigg(\sum_{i=1}^Pw_ix_i\bigg),
$$

where $P$ is the number of features used,
$x_i$ is the $i$-th feature,
and $\Phi(\cdot)$ is a function applied to the output.
In linear regression $\Phi(\cdot)$ is simply an identity function.
In logistic regression $\Phi(\cdot)$ is the standard sigmoid function.

Now consider there is a way to ensemble multiple such additive models together to generate a potentially better and more sophisticated model.
We use the following diagram for illustration.

```{r nn_diagram, echo=FALSE}
# Draw a simple neural nets.
DiagrammeR::grViz("
digraph subscript {

  graph [layout = dot rankdir = LR ordering = in ranksep=2 splines=true]

  node [shape = circle]

  subgraph cluster_input_layer {
    label = 'Input Layer'
    x1 [label = 'x@_{1}']
    x2 [label = 'x@_{2}']
    x3 [label = 'x@_{3}']
  }

  subgraph cluster_hidden_layer {
    label = 'Hidden Layer'
    n11 [label = 'y@_{11}' color = blue fontcolor = blue]
    n12 [label = 'y@_{12}' color = darkgreen fontcolor = darkgreen]
  }

  subgraph cluster_output_layer {
    label = 'Output Layer'
    n21 [label = 'y@_{2}']
  }

  edge [arrowsize = .25]

  x1 -> n11 [label = 'w@_{111}' color = blue fontcolor = blue]
  x1 -> n12 [label = 'w@_{112}' color = darkgreen fontcolor = darkgreen]

  x2 -> n11 [label = 'w@_{211}' color = blue fontcolor = blue]
  x2 -> n12 [label = 'w@_{212}' color = darkgreen fontcolor = darkgreen]

  x3 -> n11 [label = 'w@_{311}' color = blue fontcolor = blue]
  x3 -> n12 [label = 'w@_{312}' color = darkgreen fontcolor = darkgreen]

  n11 -> n21 [label = 'w@_{1121}']
  n12 -> n21 [label = 'w@_{1221}']

}")
```

In the above diagram,
$Y_{11}$ is a single additive model

$$
Y_{11} = \Phi(W_{111}X_1 + W_{211}X_2 + W_{311}X_3).
$$

Similarly $Y_{12}$ is another such model (with the same input feature set but different model weights)

$$
Y_{12} = \Phi(W_{112}X_1 + W_{212}X_2 + W_{312}X_3).
$$

Model $Y_2$ is yet another additive model but takes the output of the above two models:

$$
Y_2 = \Phi(W_{1121}Y_{11} + W_{1221}Y_{12}).
$$

The above setup is a simple architecture of a neural network model,
with only one hidden layer of two *neurons*.
(We also ignore the constant/bias term in each layer for simplicity.)
A neuron is simply an additive model with a so-called activation function $\Phi(\cdot)$ to transform the output from any real number into a scaled signal.

One now can easily realize that a logistic regression model could be viewed as a degenerated neural network model with single neuron and using sigmoid as the activation function.
And a linear regression model is a degenerated neural network model with single neuron and without an activation function.

## Activation Function

Why do we need the activation function?
In the above neural network model in the absence of activation function the final output model $Y_2$ will degenerate into a simple linear model.
We can use simpler notations to demonstrate this:

$$
\begin{aligned}
y_1 &= ax + b, \\
y_2 &= cx + d, \\
y_3 &= e + fy_1 + gy_2 \\
&= e + f(ax + b) + g(cx + d) \\
&= \underbrace{(e + fb + gd)}_\text{Bias} + \underbrace{(fa + gc)}_\text{Weight}x.
\end{aligned}
$$

Without activation function,
no matter how many neurons or layers we design for our model,
it eventually reduces to a simple linear model.
With the activation function applied to each neuron,
the model becomes *non-linear* and hence can handle much more complicated patterns hidden behind the data.

Some popular activation functions:

+ [Sigmoid (Logistic)](https://en.wikipedia.org/wiki/Logistic_function)
+ [Tanh (Hyperbolic Tangent)](https://en.wikipedia.org/wiki/Hyperbolic_function#Hyperbolic_tangent)
+ [ReLU (Rectified Linear Unit)](https://en.wikipedia.org/wiki/Rectifier_(neural_networks))
+ Swish: $f(x) = x \cdot \mbox{sigmoid}(x)$, proposed by [Google Brain](https://ai.google/research/teams/brain/)

## Backpropagation

To solve for model weights in a neural network,
we use a technique called backpropagation which is essentially an iterative process of gradient descent.

To simplify notation we assume each neuron is simply a univariate model.
Consider the following minimum architecture:

```{r simplified_nn_diagram, echo=FALSE}
DiagrammeR::grViz("
digraph subscript {

  graph [layout = dot rankdir = LR ordering = in ranksep=2]

  node [shape = circle]

  x0 [label = '1']
  x1 [label = 'x']
  z0 [label = '1']
  z1 [label = 'z']
  y  [label = 'y']

  edge [arrowsize = .25]

  x0 -> z1 [label = 'b@_{0}']
  x1 -> z1 [label = 'w@_{0}']
  z0 -> y  [label = 'b@_{1}']
  z1 -> y  [label = 'w@_{1}']

}")
```

Mathematically:

$$
\begin{aligned}
\hat{y} 
&= \Phi(b_1 + w_1z) \\
&= \Phi(b_1 + w_1\Phi(b_0 + w_0x)),
\end{aligned}
$$

where

$$
z = \Phi(b_0 + w_0x),
$$

and

$$
\Phi(t) = \frac{1}{1 + e^{-t}}.
$$

### MSE Loss

Though it may not be very meaningful to use MSE loss when the output layer is applied with an activation function,
we can still do it for educational purpose.

$$
\mbox{MSE-Loss} = \frac{1}{N}\sum_i^N  (y_i - \hat{y}_i)^2.
$$

Firstly we take the derivative w.r.t. the weight in the last layer:

$$
\begin{aligned}
\frac{\partial \mbox{MSE-Loss}}{\partial w_1}
&= - \frac{1}{N}\sum_i (y_i - \hat{y}_i)\frac{\partial \hat{y}_i}{\partial w_1} \\
&= - \frac{1}{N}\sum_i (y_i - \hat{y}_i)
\underbrace{  \frac{\partial t}{\partial w_1}\frac{\partial \Phi(t)}{\partial t} }_{t = b_1 + w_1\Phi(b_0 + w_0x)}\\
&= - \frac{1}{N}\sum_i (y_i - \hat{y}_i) \cdot \Phi(t)(1 - \Phi(t)) \cdot z_i.
\end{aligned}
$$

Similarly for the bias in the last layer:

$$
\frac{\partial \mbox{MSE-Loss}}{\partial b_1} = - \frac{1}{N}\sum_i (y_i - \hat{y}_i) \cdot \Phi(t)(1 - \Phi(t)).
$$

Now move on to the bias and weight in the first layer:

$$
\begin{aligned}
\frac{\partial \mbox{MSE-Loss}}{\partial w_0}
&= - \frac{1}{N}\sum_i (y_i - \hat{y}_i)\frac{\partial \hat{y}_i}{\partial w_0} \\
&= - \frac{1}{N}\sum_i (y_i - \hat{y}_i)
\underbrace{  \frac{\partial t}{\partial w_0}\frac{\partial \Phi(t)}{\partial t} }_{t = b_1 + w_1\Phi(b_0 + w_0x)}\\
&= - \frac{1}{N}\sum_i (y_i - \hat{y}_i) \cdot \Phi(t)(1 - \Phi(t)) \cdot 
\underbrace{ \Phi(k)(1 - \Phi(k)) }_{k = b_0 +w_0x} \cdot w_1 \cdot x_i, \\
\frac{\partial \mbox{MSE-Loss}}{\partial b_0} 
&= - \frac{1}{N}\sum_i (y_i - \hat{y}_i) \cdot \Phi(t)(1 - \Phi(t)) \cdot \Phi(k)(1 - \Phi(k)).
\end{aligned}
$$

One can clearly see there is a linkage between the derivative of the weights in consecutive layers:

$$
\begin{aligned}
\frac{\partial \mbox{MSE-Loss}}{\partial w_1}
&= - \frac{1}{N}\sum_i (y_i - \hat{y}_i) \cdot \Phi(t)(1 - \Phi(t)) \cdot z_i, \\
\frac{\partial \mbox{MSE-Loss}}{\partial w_0}
&= - \frac{1}{N}\sum_i (y_i - \hat{y}_i) \cdot 
\underbrace{\Phi(t)(1 - \Phi(t))}_{\hat{y}_i(1 - \hat{y}_i)} \cdot 
\underbrace{\Phi(k)(1 - \Phi(k)) \cdot w_1 \cdot x_i}_{\frac{\partial w_1z}{\partial w_0} = \frac{\partial w_1\Phi(b0 + w_0x)}{\partial w_0}}.
\end{aligned}
$$

### Cross-Entropy Loss

Similarly we can derive the gradients for cross-entropy loss.

$$
\begin{aligned}
\mbox{LogLoss} 
&= - \frac{1}{N}\sum_i^N \bigg[ y_i\ln\hat{y}_i + (1 - y_i)\ln(1 - \hat{y}_i)\bigg], \\
\frac{\partial \mbox{LogLoss}}{\partial w_1}
&= - \frac{1}{N} \sum_i (y_i - \hat{y}_i)z_i, \\
\frac{\partial \mbox{LogLoss}}{\partial w_0}
&= - \frac{1}{N} \sum_i (y_i - \hat{y}_i)\Phi(k)(1 - \Phi(k))w_1x_i.\\
\end{aligned}
$$

(Skip the bias terms to save space.)

Now let's implement the simple neural network model in Python:

```{python nn}
def sigmoid(x):
  return np.exp(-np.logaddexp(0, -x))


def sigmoid_grad(z):
  return z*(1 - z)


class NeuralNetwork:
  """Simple neural network with 1 hidden layer of 4 neurons."""
  def __init__(self, x, y):
    self.input = x
    self.w1 = np.random.rand(self.input.shape[1], 4)
    self.w2 = np.random.rand(4, 1)
    self.y = y
    self.output = np.zeros(y.shape)

  def feedforward(self):
    self.layer1 = sigmoid(np.dot(self.input, self.w1))
    self.output = sigmoid(np.dot(self.layer1, self.w2))

  def backprop(self, lr):
    raise NotImplementedError
  
  def train(self, lr, n_step):
    for step in range(n_step):
      self.feedforward()
      self.backprop(lr=lr)
      yhat = ",".join([str(y) for y in np.squeeze(self.output)])
    return yhat


class NerualNetworkWithMSELoss(NeuralNetwork):
  def backprop(self, lr):
    # We ignore all constant terms in the gradient.
    self.dy = self.y - self.output
    dw2 = - np.dot(self.layer1.T, (self.dy*sigmoid_grad(self.output)))
    dw1 = - np.dot(self.input.T, 
                   (np.dot(self.dy*sigmoid_grad(self.output),
                           self.w2.T)*sigmoid_grad(self.layer1)))
    self.w1 -= lr * dw1
    self.w2 -= lr * dw2


class NerualNetworkWithLogLoss(NeuralNetwork):
  def backprop(self, lr):
    # We ignore all constant terms in the gradient.
    self.dy = self.y - self.output
    dw2 = - np.dot(self.layer1.T, self.dy)
    dw1 = - np.dot(self.input.T, 
                   np.dot(self.dy, self.w2.T)*sigmoid_grad(self.layer1))
    self.w1 -= lr * dw1
    self.w2 -= lr * dw2
```

Here we will use the hello-world example of artificial neural network:
The [XOR](https://en.wikipedia.org/wiki/XOR_gate) problem.
The XOR logical outcome,
besides extremely simple,
is not linearly separable.
So it serves as a good example of showcasing neural networks' non-linearity.

The input data is simply combination of two binary switches.
(For completeness we also include a constant term which always evaluate to 1 as the first feature.)
The output data is the XOR result.

```{python nn_xor_data}
X = np.array(
    [[1, 0, 0],
     [1, 1, 0],
     [1, 0, 1],
     [1, 1, 1]])
y = np.array([[0], [1], [1], [0]])

print(X)  # Input features: Bias + two binary switches.
print(y)  # XOR output.
```

Now let's see if our simple neural network model can learn the XOR pattern.

```{python nn_gd_mse_loss}
nn_mse = NerualNetworkWithMSELoss(X, y)
nn_mse.train(lr=1, n_step=3000)  # The predicted XOR outcome.
print(nn_mse.w1)  # Estimated 1st-layer model weights
print(nn_mse.w2)  # Estimated 2nd-layer model weights
```

```{python nn_gd_log_loss}
nn_logloss = NerualNetworkWithLogLoss(X, y)
nn_logloss.train(lr=1, n_step=3000)
```

For this simple example,
both MSE and cross-entropy loss can work fine to figure out the XOR pattern.
(Strictly speaking cross-entropy loss performs better.)

## Regularization

### L2-Norm

### L1-Norm

### Dropout


